3 research outputs found

    Treatment of missing data in Bayesian network structure learning : an application to linked biomedical and social survey data

    Get PDF
    The authors acknowledge the Research/Scientific Computing teams at The James Hutton Institute and NIAB for providing computational resources and technical support for the “UK’s Crop Diversity Bioinformatics HPC” (BBSRC grant BB/S019669/1), use of which has contributed to the results reported within this paper. Access to this was provided via the University of St Andrews Bioinformatics Unit which is funded by a Wellcome Trust ISSF award (grant 105621/Z/14/Z and 204821/Z/16/Z). XK was supported by a World-Leading PhD Scholarship from St Leonard’s Postgraduate School of the University of St Andrews. VAS and KK were partially supported by HATUA, The Holistic Approach to Unravel Antibacterial Resistance in East Africa, a three-year Global Context Consortia Award (MR/S004785/1) funded by the National Institute for Health Research, Medical Research Council and the Department of Health and Social Care. KK is supported by the Academy of Medical Sciences, the Wellcome Trust, the Government Department of Business, Energy and Industrial Strategy, the British Heart Foundation Diabetes UK, and the Global Challenges Research Fund [Grant number SBF004\1093]. KK is additionally supported by the Economic and Social Research Council HIGHLIGHT CPC- Connecting Generations Centre [Grant number ES/W002116/1].Background Availability of linked biomedical and social science data has risen dramatically in past decades, facilitating holistic and systems-based analyses. Among these, Bayesian networks have great potential to tackle complex interdisciplinary problems, because they can easily model inter-relations between variables. They work by encoding conditional independence relationships discovered via advanced inference algorithms. One challenge is dealing with missing data, ubiquitous in survey or biomedical datasets. Missing data is rarely addressed in an advanced way in Bayesian networks; the most common approach is to discard all samples containing missing measurements. This can lead to biased estimates. Here, we examine how Bayesian network structure learning can incorporate missing data. Methods We use a simulation approach to compare a commonly used method in frequentist statistics, multiple imputation by chained equations (MICE), with one specific for Bayesian network learning, structural expectation-maximization (SEM). We simulate multiple incomplete categorical (discrete) data sets with different missingness mechanisms, variable numbers, data amount, and missingness proportions. We evaluate performance of MICE and SEM in capturing network structure. We then apply SEM combined with community analysis to a real-world dataset of linked biomedical and social data to investigate associations between socio-demographic factors and multiple chronic conditions in the US elderly population. Results We find that applying either method (MICE or SEM) provides better structure recovery than doing nothing, and SEM in general outperforms MICE. This finding is robust across missingness mechanisms, variable numbers, data amount and missingness proportions. We also find that imputed data from SEM is more accurate than from MICE. Our real-world application recovers known inter-relationships among socio-demographic factors and common multimorbidities. This network analysis also highlights potential areas of investigation, such as links between cancer and cognitive impairment and disconnect between self-assessed memory decline and standard cognitive impairment measurement. Conclusion Our simulation results suggest taking advantage of the additional information provided by network structure during SEM improves the performance of Bayesian networks; this might be especially useful for social science and other interdisciplinary analyses. Our case study show that comorbidities of different diseases interact with each other and are closely associated with socio-demographic factors.PostprintPublisher PDFPeer reviewe

    Generation and Analysis of GATA2w/eGFP Human ESCs Reveal ITGB3/CD61 as a Reliable Marker for Defining Hemogenic Endothelial Cells during Hematopoiesis

    Get PDF
    The transition from hemogenic endothelial cells (HECs) to hematopoietic stem/progenitor cells (HS/PCs), or endothelial to hematopoietic transition (EHT), is a critical step during hematopoiesis. However, little is known about the molecular determinants of HECs due to the challenge in defining HECs. We report here the generation of GATA2w/eGFP reporter in human embryonic stem cells (hESCs) to mark cells expressing GATA2, a critical gene for EHT. We show that during differentiation, functional HECs are almost exclusively GATA2/eGFP+. We then constructed a regulatory network for HEC determination and also identified a panel of positive or negative surface markers for discriminating HECs from non-hemogenic ECs. Among them, ITGB3 (CD61) precisely labeled HECs both in hESC differentiation and embryonic day 10 mouse embryos. These results not only identify a reliable marker for defining HECs, but also establish a robust platform for dissecting hematopoiesis in vitro, which might lead to the generation of HSCs in vitro
    corecore